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Abstract: Classifying text data lias been an active area of research for a long time. Text document is multifaceted object and 
often inherently ambiguous by nature. Multi-label learning deals with such ambiguous object. Classification of such 
ambiguous text objects often makes task of classifier difficult while assigning relevant classes to input document. Traditional 
single label and multi class text classification paradigms cannot efficiently classify such multifaceted text corpus. Through 
our paper we are proposing a novel label propagation approach based on semi supervised learning for Multi Label Text 
Classification. Our proposed approach models the relationship between class labels and also effectively represents input text 
documents. We are using semi supervised learning technique for effective utilization of labeled and unlabeled data for 
classification . Our proposed approach promises better classification accuracy and handling of complexity and elaborated 
on the basis of standard datasets such as Enron, Slashdot and Bibtex. 
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I. INTRODUCTION 

The amount of textual data being produced through internet is growing faster than the ability of information 
consumers to search, digest and use it. Textual data is difficult to effectively understand and categorize because the 
relationship between its sequence of words and its content is less clear as compared to numerical. Such data includes technical 
article, memos, manuals, electronic mail, books, online news paper, journal articles and many other forms of texts. Thus text 
classification has become an active research topic now a day. It classifies document under a predefined category. Categories 
may be represented numerically or using single word or phrase or words with senses, etc. In traditional approach, classification 
of text was carried out manually using domain experts. The human expert was required to read and sort the input text 
document to predefined category or set of categories. Thus this approach requires extensive human efforts and error prone 
also. This leads to the scheme of automated text classification scenario. This automated text document classification facilitates 
ease of storage, searching, retrieval of relevant text documents or its contents for the needy applications. Three different 
paradigm exists under text classification and they are single label(Binary) , multiclass and multi label. Under single label a 
new text document belongs to exactly one of two given classes, in multi -class case a new text document belongs to just one 
class of a set of m classes and under multi label text classification scheme each document may belong to several classes 
simultaneously [3]. In real practice many approaches are exists and proposed for binary case and multi class case even though 
in many applications text documents are inherently multi label in nature. Eg. In medical diagnosis a document report 
containing set of symptoms can belong to many probable disease categories. Multilabel text classification problem refers to 
the scenario in which a text document can be assigned to more than one classes simultaneously during the process of 
classification. Eg. In the process of classification of online news article the news stories about the scams in the commonwealth 
games in india can belong to classes like sports, politics , country-india etc. It has attracted significant attention from lot of 
researchers for playing crucial role in many applications such as web page classification, classification of news articles , 
information retrieval etc. 

Multilabel text classification problem refers to the scenario in which a text document can be assigned to more than 
one classes simultaneously during the process of classification.. It has attracted significant attention from lot of researchers for 
playing crucial role in many applications such as web page classification, classification of news articles , information retrieval 
etc. Generally supervised methods from machine learning are mainly used for realization of multi label text classification. 
But as it needs labeled data for classification all the time, semi supervised methods are used now a day in multi label text 
classifier. Many approaches are preferred to implement multi label text classifier. Through our paper we are proposing label 
propagation approach for multi label text classifier , it uses existing label information for identifying labels of unlabeled 
documents.We are representing input text document corpus in the form of graph to exploit the ambiguity among different text 
documents. The ambiguity is represented in the form of similarity measures as a weighted edge between text documents . With 
the setting of semi supervised learning we have focused on not only graph construction but also sparsification and weighting 
of graph to improve classifiers accuracy. We apply the proposed framework on standard dataset such as Enron, Bibtex and 
slashdot. 

The rest of the paper is organized as below. Section 2 describes literature related to semi supervised learning 
methods for multi label text classification system ; Section 3 highlights mathematical modeling of our approach . Section 4 
describes our proposed label propagation approach for building multi label text classifier followed by experiments and results 
in Section 5 , followed by a conclusion in the last section. 
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II. RELATED WORK 

Multilabel text classifier can be realized by using supervised, unsupervised and semi supervised methods of machine 
learning. In supervised methods only labeled text data is needed for training. Unsupervised methods relies heavily on only 
unlabeled text documents; whereas semi supervised methods can effectively use unlabeled data in addition to the labeled 
data[l][2]. The traditional approach towards multi-label learning either decomposes the classification task into multiple 
independent binary classification tasks or identifies rank to find relevant set of classes. But these methods do not exploit 
relationship among class labels. Few popular existing methods are binary relevance method, label power set method, pruned 
sets method, C4.5, Adaboost.MH & Adaboost.MR, ML-kNN , Classifier chains method etc[20]. But all these are lacking the 
capability of handling unlabeled data ie these are based on principle of supervised learning. 

While designing a multi label text classifier the major objective is not only to identify the set of classes belonging to 
given new text documents but also to identify most relevant out of them to improve accuracy of overall classification process. 
Graph based approaches are known for their effective exploration of document representation and semi supervised methods 
explores both labeled and unlabeled data for classification thatswhy accuracy of multi label text classifier can be improved by 
using graph based representation of input documents in conjunction with label propagation approach of semi supervised 
learning[16][17]. 

Table 1 summarizes few existing well-known representative methods for multi label text classifier based on semi 
supervised learning , few uses only graph based framework and few uses both. 



Table 1: Statistics of popular algorithms for MLTC based on semi supervised learning and graph based representatio n. 



Algorithm and Year of 


Working Theme 


Datasets used for 


Merits 




Demerits 


proposal 




experimentation 








Multi-label 


Optimization of class 


ESTA 


Powerful 




Parameter 


classification by 


labels assignment by 




representation 


of 


selection is 


Constrained Non- 


using similarity 




input documents 


crucial. 


Negative Matrix 


measures and non 




using NMF 


and 




Factorization [2006] [8] 


negative matrix 




also works 


for 






factorization. 




large scale datasets 




Graph-based SSL with 


Exploits correlation 


Video files : 


Effective 




Can not 


multi-label [2008] [9] 


among labels along with 


TECVID 2006. 


utilization 


of 


applicable to 




labels consistency over 




unlabeled data. 




text data , more 




graph. 








effective on 
video data. 


Semi supervised multi- 


Graph construction for 


Reuters 


Improved 




May get slower 


label learning by 


input documents and 




accuracy 




on convergence. 


solving a Sylvester Eq 


class labels. 










[2010][10] 












Semi-Supervised Non 


Performs joint 


20-news, CSTR, 


Able to extract 


High 


negative Matrix 


factorization of data and 


kla,klb,WebKB4, 


more 




computational 


Factorization 


labels and uses 


Reuters 


discriminative 




complexity. 


[2009][11]. 


multiplicative updates 
performs classification. 




features 







In preprocessing stage graph based approaches can effectively represents relationship between labeled and unlabeled 
documents by identifying structural and semantical relationship between them for more relevant classification ; and during 
training phase semi supervised methods can propagate labels of labeled documents to unlabeled documents based on some 
energy function or regularizer. Our proposed work is based on the same strategy. 

III. MATHEMATICAL MODEL OF PROPOSED SYSTEM 

In this section we are introducing some notions related with text classification. We are firstly representing the input 
document corpus in the form of graph.The process of graph construction deals with conversion of input text document corpus , 
X to graph G ie X -> G , where X represents input text document corpus xl,x2,..,xn wherein each text document instance Xj 
in turn represented as m-dimensional feature vector. And G represents overall graph structure as G=(V,E) where V = set of 
vertices corresponding to document instance Xj ; E represents set of weighted edges between pair of vertices where associated 
edge weight corresponds to similarity between two documents. Generally weight matrix W is computed to identify the 
similarity between pair of text documents. Various similarity measures such as cosine, Jacobi or kernel functions K(.) like 
RBF kernel , Gaussian kernel can be used for this purpose. 

Now we are defining our graph based multi label text classifier system S as follows : 
S = { X , Y , T , y, h}; where X represents entire input text document corpus = {xl,x2,..,xnj. Out of these ILI numbers of 
documents are labeled and remaining are unlabeled.Y represents set of possible labels = {Y1,Y2,. . .,Yn}. T represents 
multilabel training set of classifier of the form {(xl,Yl), (x2,Y2), , (xn,Yn)} where xi 6 X is a single document instance 
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and Yi — Y is the label set associated with xi . y represents set of estimated labels = {Pi , Yu). The goal of the system is to 
learn a function h ie 

h : X -¥ 2y from T which predicts set of labels for unlabeled documents ie xl+1 ..xn 

With this graph based setting, we are using semi supervised learning to propagate labels on the graph from labeled nodes 
to unlabeled nodes and compare the estimated labels y with the true labels. 

IV. PROPOSED APPROACH 

We are mainly using theme of smoothness assumption of semi supervised learning to propagate the labels of 
labeled documents to unlabeled documents. Smoothness assumption of semi supervised learning states that "if two input 
points xl,x2 are in a high-density region are close to each other then so should be the corresponding outputs yl,y2" . Thus 
based on this we mainly emphasized on exploiting relationships between input text documents in the form of graph and 
relationship between the class labels in the form of correlation matrix. The purpose behind this is to reduce classification errors 
and assignment of more relevant class labels to new test document instance. 

During classifiers training phase we are computing similarity between input documents to identify whether they are 
in high density or low density region. We evaluated relationships between documents by using cosine similarity measure and 
represented it in the form of weighted matrix, W as : 

W tj = exp(-X ( 1 - cos(dj,dj))) 

Where XI and X2 are two text documents represented in the feature space. Large cosine value indicates similarity 
and small value indicates that documents are dissimilar. 

After that we performed graph sparcification by representing it in the form of diagonal matrix in order to reduce 
consideration of redundant data. So we normalize the term l/ll djl lldjll , we calculated the diagonal matrix as Dcos(d;,d ; ) = 

l/^//(i)/(() 7 * where F(i) is the ith row vector of F. While identifying relationships between class labels we computed 
correlation matrix C mxm where m is no. of class labels using RBF kernel. Each class is represented in the form of vector 
space whose elements are said to be 1 when corresponding text document belongs to the class under consideration. 

Then in testing phase, in order to provide relevant label set to unlabeled document we computed energy function E 
to measure smoothness of label propagation. This energy function measures difference between weight matrix W and dot 
product of sparcified diagonal matrix with correlation matrix. 

E = i-.Wij - DCij 

The labels are propagated based on minimum value of Energy function. It indicates that if two text documents are 
similar to each other then the assigned class labels to them are also likely to be closer to each other. In other words two 
documents sharing highly similar input pattern are likely to be in high density region and thereby the classes assigned to them 
are likely to be related and propogated to those documents which in turn resides in same high density region. 

After this label propagation phase, we obtained labels of all unlabeled document instances. We computed accuracy 
to verify correct assignment of label sets. The corresponding results are given in table [III]. We once again ensured the 
working by applying all this document and label set to existing classifier chains method which is supervised in nature. We 
used decision tree(J48 in WEKA),SVM (SMO & libSVM) separately as base classifiers and computed the results. The 
corresponding results are given in table [IV]. 

The summary of our proposed label propagation approach is given as : 

Input - T : The multi label training set {(xl,Yl), (x2,Y2), , (xn,Yn)}. 

z : The test document instance such that z X 

Output - The predicted label set for z . 
Process: 

XI X2 

Compute the edge weight matrix W as Wij = arc cos ■ — - — ■ and assign W i; =0 

\X\\.\X2\ 

Sparcify the graph by computing diagonal degree matrix D as Dji=^ Wij 
Compute the label correlation matrix C mxm using RBF kernel method 
Initialize Y (0) to the set of (Y,,Y 2 ,. ...Y^O.O, ,0) 



Iterate till convergence to Y {m> 
,. E^Wij- D-'Cij 



Xt+D_i 



3. ? <1+1) 1= Y, 
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Label point z by the sign of Y j 



V. EXPERIMENTS AND RESULTS 

In this section, in order to evaluate our approach we conducted experiments on three text based datasets namely 
Enron , Slashdot , Bibtex and measured accuracy of overall classification process. Table II summarizes the statistics of 
datasets that we used in our experiments. 

Table II: Statistics Of Datasets 



Dataset 


No. of document 
instances 


No. of Labels 


Attributes 


Slashdot 


3782 


22 


500 


Enron 


1702 


53 


1001 


Bibtex 


7395 


159 


1836 



Enron dataset contains email messages. It is a subset of about 1700 labeled email messages [21]. BibTeX data set 
contains metadata for the bibtex items like the title of the paper, the authors, etc. Slashdot dataset contains article titles and 
partial blurbs mined from Slashdot. org[22]. 

We used accuracy measure proposed by Godbole and Sarawagi in [13] . It symmetrically measures how close y ; is to 
Zi ie estimated labels and true labels. It is the ratio of the size of the union and intersection of the predicted and actual label 
sets, taken for each example and averaged over the number of examples. The formula used by them to compute accuracy is as 
follows: 



1 N 

Accuracy = — /^ 



7 u Z. 



We also computed precision , recall and F-measure values , the formula used to compute them is as follows: 

F-Measure = 2.0 x prec ision x recall 

precision + recall 

1 A2[y f nZ f ] 



F-Measure 



wtffcHr.J 



We evaluated our approach under a WEKA-based [23] framework running under Java JDK 1.6 with the libraries of 
MEKA and Mulan [21] [22], Jblas library for performing matrix operations while computing weights on graph edges. 
Experiments ran on 64 bit machines with 2.6 GHz of clock speed, allowing up to 4 GB RAM per iteration. Ensemble iterations 
are set to 10 for EPS. Evaluation is done in the form of 5 x 2 fold cross validation on each dataset . We first measured the 
accuracy, precision , Recall and after label propagation phase is over. Table III enlists accuracy measured for each dataset. 



Table III: Results after Label Propagation Phase 


Evaluation 
Criterion 


Enron 


Slashdot 


Bibtex 


Accuracy 


90 


89 


92 


Precision 


50 


49 


48 


Recall 


49 


47 


46 


F-measure 


50 


47 


47 



After label propagation phase , we obtained labels of all unlabeled documents. Thus we get entire labeled dataset as 
a result now. We applied this labeled set to Ensemble of classifier chains method which is supervised in nature[24] and 
measured accuracy precision, recall on three different base classifiers of decision tree(J48 in WEKA) , and two variations of 
support vector machine (SMO in WEKA , libSVM).We also measured overall testing and building time required for this 
process. The Ensemble of classifier chains method (ECC) is proven and one of the efficient supervised multi label text 
classification technique , we verified our entire final labeled dataset by giving input to it. The results are enlisted in table IV 
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Table IV: Result after using supervised multi label classifier 



Dataset: Slashdot 



Parameters 


BS: 
SMO 


BS : libSVM 


BS : J48 


LCard(training) 


2.42 


2.27 


2.23 


#(training Samples) 


1135.0 


1135.0 


1135.0 


LCard(testing) 


2.32 


2.20 


2.1 


#(testing Samples) 


756.0 


756.0 


756.0 


Test time 


69.5 


28.9 


29 


Build time 


173.9 


2609.078 


2546.17 


Recall 


0.40 


0.56 


0.53 


Threshold 


0.01 


0.2 


0.2 


Fl micro 


0.56 


0.52 


0.48 


Precision 


0.93 


0.49 


0.44 


Accuracy 


0.41 


0.41 


0.32 



Dataset: Enron . 



Parameters 


BS : SMO 


BS: 
libSVM 


BS : J48 


LCard(training) 


2.42 


2.26 


2.23 


#(training Samples) 


1135.0 


1135.0 


1135.0 


LCard(testing) 


2.32 


2.20 


2.13 


#(testing Samples) 


756.0 


756.0 


756.0 


Test time 


69.516 


28.89 


28.97 


Build time 


173.891 


2609.078 


2546.17 


Recall 


0.40 


0.56 


0.53 


Threshold 


0.0010 


0.2 


0.2 


Fl micro 


0.56 


0.52 


0.48 


Precision 


0.94 


0.49 


0.44 


Accuracy 


0.41 


0.41 


0.32 



Dataset : Bibtex 



Parameters 


BS : SMO 


BS: 
libSVM 


BS : J48 


LCard(training) 


2.45 


2.27 


2.23 


#(training Samples) 


1135.0 


1135.0 


1135.0 


LCard(testing) 


2.32 


2.20 


2.13 


#(testing Samples) 


756.0 


756.0 


756.0 


Test time 


69.516 


28.89 


28.9 


Build time 


173.8 


2609.078 


2546.172 


Recall 


0.40 


0.56 


0.53 


Threshold 


0.0010 


0.2 


0.2 


Fl_micro 


0.56 


0.52 


0.48 


Precision 


0.93 


0.49 


0.44 


Accuracy 


0.41 


0.41 


0.32 



VI. CONCLUSION AND FUTURE WORK 

We have proposed a novel label propagation based approach for multi label classifier. It works in conjunction with 
semi supervised learning setting by considering smoothness assumptions of data points and labels. The approach is evaluated 
using small scale datasets (Enron , Slashdot ) as well as large scale dataset (Bibtex). It is also verified against traditional 
supervised method. Our approach shows significant improvement in accuracy by incorporating unlabeled data along with 
labeled training data. But significant amount of computational time is required to calculate similarity among documents as 
well as class labels. The input text corpus is well exploited as a graph however, in the future the use of feature extraction 
methods like NMF with Latent Semantic indexing may provide more stable results. 
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